Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data
نویسندگان
چکیده
In front of the large increase of the available amount of structured data (such as XML documents), many algorithms have emerged for dealing with tree-structured data. In this article, we present a probabilistic approach which aims at a priori pruning noisy or irrelevant subtrees in a set of trees. The originality of this approach, in comparison with classic data reduction techniques, comes from the fact that only a part of a tree (i.e. a subtree) can be deleted, rather than the whole tree itself. Our method is based on the use of confidence intervals, on a partition of subtrees, computed according to a given probability distribution. We propose an original approach to assess these intervals on tree-structured data and we experimentally show its interest in the presence of noise.
منابع مشابه
Probabilistic Approach for Reduction of Irrelevant Tree-structured Data
This article aims at pruning noisy or irrelevant subtrees in a set of trees. The originality of this approach, in comparison with classic techniques in prototype selection, comes not from the non-deletion of the whole tree, but rather of some of its subtrees. Our method is based on the computation of confidence intervals on a set of subtrees according to a probability distribution. We propose a...
متن کاملApproximate Tree Kernels
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation techni...
متن کاملMining Maximal Frequent Subtrees based on Fusion Compression and FP-tree
It is commonly accepted that mining frequent subtrees play pivotal roles in areas like Web log analysis, XML document analysis, semi-structured data analysis, as well as biometric information analysis, chemical compound structure analysis, etc. An improved algorithm, i.e. MFPTM algorithm, which based on fusion compression and FP-tree principle, was proposed in this paper to determine a better w...
متن کاملLearning Metrics Between Tree Structured Data: Application to Image Recognition
The problem of learning metrics between structured data (strings, trees or graphs) has been the subject of various recent papers. With regard to the specific case of trees, some approaches focused on the learning of edit probabilities required to compute a so-called stochastic tree edit distance. However, to reduce the algorithmic and learning constraints, the deletion and insertion operations ...
متن کاملA hybrid model based on machine learning and genetic algorithm for detecting fraud in financial statements
Financial statement fraud has increasingly become a serious problem for business, government, and investors. In fact, this threatens the reliability of capital markets, corporate heads, and even the audit profession. Auditors in particular face their apparent inability to detect large-scale fraud, and there are various ways to identify this problem. In order to identify this problem, the majori...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Fundam. Inform.
دوره 66 شماره
صفحات -
تاریخ انتشار 2005